Downstream Analysis of Synteny Networks with MSA2dist

Kristian K Ullrich

Max Planck Institute for Evolutionary Biology

Background - DNA Evolution

A matter of distance.

Background - Coding Sequences

Synonymous and non-synonymous substitutions: nucleotide substitutions might lead to changes on amino acid level.

Typically used to:

  1. Detect and date whole-genome duplication (WGD) events;
  2. Predict selective pressure acting on a protein;
  3. Predict selective pressure acting on a codon;
  4. Investigate codon usage (R/Bioconductor package coRdon).

MSA2dist

An R/Bioconductor package to calculate pairwise distances between all sequences of a MultipleSequenceAlignment (DNAStringSet or AAStringSet) and to conduct codon based analysis.

MSA2dist Features:

  1. Calculates pairwise AA, DNA and IUPAC distances;
  2. Calculates Synonymous and NonSynonymous Substitutions (dN/dS) KaKs_Calcultor 2.0 models (C++ ported to R with Rcpp);
  3. Use pre-calculated MSA or conduct all possible pairwise alignments prior dN/dS calculation;
  4. Calculate average behavior of each codon from MSA.

MSA2dist workflow

Installation

From Bioconductor:

BiocManager::install(version='devel')
BiocManager::install("MSA2dist")


From GitHub:

remotes::install_github("kullrich/MSA2dist")

Example data set

Data: Whole-genome coding sequences and gene ranges for Chlorophyta algae + Physcomitrium patens as an outgroup.

Source: Pico-PLAZA 3.0

Where to find: the data/ directory in the GitHub repo associated with this presentation.

data
├── annotation.rda
├── cds.rda
├── clusters.rda
├── data_acquisition.Rmd
├── diamond_list.rda
├── profiles.rda
└── proteomes.rda
  1. cds.rda: DNAStringSet object.
  2. clusters.rda: Pre-calculated syntenet clusters.

Example data set: cds

# Load necessary libraries
library(MSA2dist)
library(Biostrings)
library(tidyr)
library(dplyr)
library(stringr)

# Load data set cds
load(here::here("data", "cds.rda"))

# Inspect data
head(names(cds))
[1] "CNC64A_028G00030" "CNC64A_028G00040" "CNC64A_028G00070" "CNC64A_028G00240"
[5] "CNC64A_028G00150" "CNC64A_028G00170"

Example data set: clusters

# Load data set clusters
load(here::here("data", "clusters.rda"))

# Inspect data
head(clusters)
                   Gene Cluster
1 Chlo_CNC64A_028G00030       1
2 Chlo_CNC64A_028G00040       2
3 Chlo_CNC64A_028G00070       3
4 Chlo_CNC64A_028G00080       4
5 Chlo_CNC64A_028G00110       5
6 Chlo_CNC64A_028G00140       6
length(table(clusters$Cluster))
[1] 22605

Fetch coding sequences for a specific cluster

# Get size distribution of clusters
head(table(clusters$Cluster))

 1  2  3  4  5  6 
38 38 37  2  9  4 
# Get first cluster of size 10 (fc10)
fc10 <- which(table(clusters$Cluster)==10)[1]

my_cluster <- clusters %>%
    dplyr::filter(Cluster==fc10)

head(my_cluster)
                      Gene Cluster
1          Oluc_OL13G02120     119
2 Bpra_BPRRCC1105_04G02530     119
3     Osp_ORCC809_13G00870     119
4          Oluc_OL21G00860     119
5         Omed_OM_08G02190     119
6     Msp_MRCC299_06G03520     119

Fetch coding sequences for a specific cluster

# Remove species unique identifier and add as GeneID
my_cluster$GeneID <- stringr::str_split_fixed(my_cluster$Gene, "_", 2)[, 2]

head(my_cluster)
                      Gene Cluster              GeneID
1          Oluc_OL13G02120     119          OL13G02120
2 Bpra_BPRRCC1105_04G02530     119 BPRRCC1105_04G02530
3     Osp_ORCC809_13G00870     119    ORCC809_13G00870
4          Oluc_OL21G00860     119          OL21G00860
5         Omed_OM_08G02190     119         OM_08G02190
6     Msp_MRCC299_06G03520     119    MRCC299_06G03520
# Fetch coding sequences
my_cds <- cds[match(my_cluster$GeneID, 
    names(cds))]

head(my_cds)
DNAStringSet object of length 6:
    width seq                                               names               
[1]  1605 ATGTCGCTCAAGTCCATCAACCC...GCGTGAACATGCGCAAGCGATGA OL13G02120
[2]  1623 ATGTCCTCCTCTCTCCGTTCGGT...GCGGTCGCGGACCGGCTGAATAA BPRRCC1105_04G02530
[3]  1605 ATGTCGCTCAAATCGATCAACCC...GCATCAACATGCGAAAGAAATAG ORCC809_13G00870
[4]  1605 ATGTCGCTCAAGTCCATCAACCC...GCGTGAACATGCGCAAGCGATGA OL21G00860
[5]  1605 ATGAGTTTGAAATCCGTCAACCC...GGGTGAACATGCGCAAACGATAA OM_08G02190
[6]  1707 ATGCAGGGCATGCAAAGCCCCAT...CTCCCGAAAAACTGGGGGAGTAA MRCC299_06G03520

Coding sequence alignment

To calculate a coding sequence alignment for two sequences:

# Get coding sequence alignment
cds1_cds2_aln <- cds2codonaln(cds1 = my_cds[1], cds2 = my_cds[2], remove.gaps = FALSE)
cds1_cds2_aln
DNAStringSet object of length 2:
    width seq                                               names               
[1]  1623 ATGTCG------CTCAAGTCCAT...GCAAGCGA------------TGA OL13G02120
[2]  1623 ATGTCCTCCTCTCTCCGTTCGGT...GCGGTCGCGGACCGGCTGAATAA BPRRCC1105_04G02530

To calculate dN/dS from this alignment:

# Calculate dN/dS for this alignment; model = Li
cds1_cds2_kaks <- dnastring2kaks(cds1_cds2_aln,
    model = "Li", threads = 1, isMSA = TRUE)
cds1_cds2_kaks
  Comp1 Comp2       seq1                seq2        ka       ks          vka
1     1     2 OL13G02120 BPRRCC1105_04G02530 0.1640926 1.849032 0.0003232055
        vks
1 0.4047548

Other Biostrings::pairwiseAlignment options can be set:

type = "global", substitutionMatrix = "BLOSUM62"
gapOpening = 10, gapExtension = 0.5

Parallelized pairwise coding sequence alignments

To calculate all pairwise combinations:

start_time <- Sys.time()
# Calculate dN/dS for the whole cluster; model = YN
my_cds_kaks <- dnastring2kaks(my_cds,
    model = "YN", threads = 2, isMSA = FALSE)
end_time <- Sys.time()

head(my_cds_kaks, n = 4)
         Comp1 Comp2       seq1                seq2 Method        Ka      Ks
result.1     1     2 OL13G02120 BPRRCC1105_04G02530     YN  0.145529 4.33579
result.2     1     3 OL13G02120    ORCC809_13G00870     YN 0.0328853 2.73769
result.3     1     4 OL13G02120          OL21G00860     YN        NA      NA
result.4     1     5 OL13G02120         OM_08G02190     YN 0.0570134 4.34114
             Ka/Ks P-Value(Fisher) Length S-Sites N-Sites Fold-Sites(0:2:4)
result.1 0.0335645    3.77014e-136   1602 324.101  1277.9                NA
result.2 0.0120121    3.37981e-125   1602 313.774 1288.23                NA
result.3        NA              NA   1602 313.942 1288.06                NA
result.4 0.0131333    1.79968e-161   1602 326.422 1275.58                NA
         Substitutions S-Substitutions N-Substitutions
result.1           443         274.184         168.816
result.2           238          196.59         41.4101
result.3             0              NA              NA
result.4           326         255.985         70.0148
         Fold-S-Substitutions(0:2:4) Fold-N-Substitutions(0:2:4)
result.1                          NA                          NA
result.2                          NA                          NA
result.3                          NA                          NA
result.4                          NA                          NA
         Divergence-Time Substitution-Rate-Ratio(rTC:rAG:rTA:rCG:rTG:rCA/rCA)
result.1        0.993263                              1.65453:1.65453:1:1:1:1
result.2        0.562657                              2.66376:2.66376:1:1:1:1
result.3              NA                                          2:2:1:1:1:1
result.4        0.929943                              1.39012:1.39012:1:1:1:1
                                    GC(1:2:3) ML-Score AICc Akaike-Weight Model
result.1 0.537893(0.536969:0.358595:0.718115)       NA   NA            NA    NA
result.2 0.561682(0.562617:0.349533:0.772897)       NA   NA            NA    NA
result.3  0.561371(0.568224:0.35514:0.760748)       NA   NA            NA    NA
result.4   0.542368(0.561682:0.35514:0.71028)       NA   NA            NA    NA

How long did it take?

end_time - start_time
Time difference of 4.759112 secs

Transform Ka values into a distance matrix

ka_mat <- my_cds_kaks %>%
    dplyr::select(seq1, seq2, Ka) %>%
    dplyr::mutate(Ka = as.numeric(Ka)) %>%
    tidyr::pivot_wider(names_from = "seq1", values_from = Ka)

head(ka_mat)
# A tibble: 6 × 10
  seq2       OL13G02120 BPRRCC1105_04G0… ORCC809_13G00870 OL21G00860 OM_08G02190
  <chr>           <dbl>            <dbl>            <dbl>      <dbl>       <dbl>
1 BPRRCC110…     0.146            NA              NA         NA           NA    
2 ORCC809_1…     0.0329            0.151          NA         NA           NA    
3 OL21G00860    NA                 0.146           0.0329    NA           NA    
4 OM_08G021…     0.0570            0.146           0.0719     0.0570      NA    
5 MRCC299_0…     0.673             0.699           0.653      0.673        0.715
6 MP09G03570     0.666             0.696           0.646      0.666        0.678
# … with 4 more variables: MRCC299_06G03520 <dbl>, MP09G03570 <dbl>,
#   MP09G02740 <dbl>, OT_13G02270 <dbl>

Transform Ka values into a distance matrix

# Extract sequence names
seq_names <- c(colnames(ka_mat)[-1], rev(ka_mat$seq2)[1])

# Add NA to fill into square distance matrix
ka_mat <- NA |> rbind(ka_mat[, -1]) |> cbind(NA) |>
    tidyr::as_tibble()

# Set column names
colnames(ka_mat) <- seq_names

head(ka_mat)
# A tibble: 6 × 10
  OL13G02120 BPRRCC1105_04G02530 ORCC809_13G00870 OL21G00860 OM_08G02190
       <dbl>               <dbl>            <dbl>      <dbl>       <dbl>
1    NA                   NA              NA         NA           NA    
2     0.146               NA              NA         NA           NA    
3     0.0329               0.151          NA         NA           NA    
4    NA                    0.146           0.0329    NA           NA    
5     0.0570               0.146           0.0719     0.0570      NA    
6     0.673                0.699           0.653      0.673        0.715
# … with 5 more variables: MRCC299_06G03520 <dbl>, MP09G03570 <dbl>,
#   MP09G02740 <dbl>, OT_13G02270 <dbl>, MRCC299_06G02640 <lgl>

Use distance matrix to plot bionjs tree

ape::bionjs reconstructs a phylogenetic tree from a distance matrix with possibly missing values:

# Calculate and plot bionjs tree
(ka_mat |> as.dist() |> ape::bionjs()) |>
    plot(cex=1)

Calculate average behavior of each codon

Here, a pre-calculated MSA is necessary:

# load example data
data(hiv, package="MSA2dist")

# calculate average behavior
hiv.xy <- hiv |> dnastring2codonmat() |> codonmat2xy()

Plot average behavior of each codon:

Here’s where you can find me: